Introduction¶

Background¶

The following project is about Public health data. This includes information about the food items available in different countries, different marks, the food grade, ingredients etc. The link to open food data set - https://world.openfoodfacts.org/

Importing relevant libraries¶

Loading dataset and preprocessing¶

code url creator created_t created_datetime last_modified_t last_modified_datetime product_name generic_name quantity ... ph_100g fruits-vegetables-nuts_100g collagen-meat-protein-ratio_100g cocoa_100g chlorophyl_100g carbon-footprint_100g nutrition-score-fr_100g nutrition-score-uk_100g glycemic-index_100g water-hardness_100g
0 3087 http://world-fr.openfoodfacts.org/produit/0000... openfoodfacts-contributors 1474103866 2016-09-17T09:17:46Z 1474103893 2016-09-17T09:18:13Z Farine de blé noir NaN 1kg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4530 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Banana Chips Sweetened (Whole) NaN NaN ... NaN NaN NaN NaN NaN NaN 14.0 14.0 NaN NaN
2 4559 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Peanuts NaN NaN ... NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN

3 rows × 162 columns

code url creator created_t created_datetime last_modified_t last_modified_datetime product_name generic_name quantity ... ph_100g fruits-vegetables-nuts_100g collagen-meat-protein-ratio_100g cocoa_100g chlorophyl_100g carbon-footprint_100g nutrition-score-fr_100g nutrition-score-uk_100g glycemic-index_100g water-hardness_100g
320769 9970229501521 http://world-fr.openfoodfacts.org/produit/9970... tomato 1422099377 2015-01-24T11:36:17Z 1491244499 2017-04-03T18:34:59Z 乐吧泡菜味薯片 Leba pickle flavor potato chips 50 g ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
320770 9980282863788 http://world-fr.openfoodfacts.org/produit/9980... openfoodfacts-contributors 1492340089 2017-04-16T10:54:49Z 1492340089 2017-04-16T10:54:49Z Tomates aux Vermicelles NaN 67g ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
320771 999990026839 http://world-fr.openfoodfacts.org/produit/9999... usda-ndb-import 1489072709 2017-03-09T15:18:29Z 1491244499 2017-04-03T18:34:59Z Sugar Free Drink Mix, Peach Tea NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 162 columns

(320772, 162)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320772 entries, 0 to 320771
Columns: 162 entries, code to water-hardness_100g
dtypes: float64(106), object(56)
memory usage: 396.5+ MB
code                       0.000072
url                        0.000072
creator                    0.000006
created_t                  0.000009
created_datetime           0.000028
                             ...   
carbon-footprint_100g      0.999165
nutrition-score-fr_100g    0.310382
nutrition-score-uk_100g    0.310382
glycemic-index_100g        1.000000
water-hardness_100g        1.000000
Length: 162, dtype: float64

missing values handling¶

%null Col
water-hardness_100g 1.0 water-hardness_100g
no_nutriments 1.0 no_nutriments
ingredients_that_may_be_from_palm_oil 1.0 ingredients_that_may_be_from_palm_oil
nutrition_grade_uk 1.0 nutrition_grade_uk
nervonic-acid_100g 1.0 nervonic-acid_100g
water-hardness_100gingredients_that_may_be_from_palm_oilnervonic-acid_100gmead-acid_100gbutyric-acid_100gingredients_from_palm_oilcerotic-acid_100gchlorophyl_100gmyristic-acid_100gcaprylic-acid_100gmontanic-acid_100gmaltose_100garachidonic-acid_100gmaltodextrins_100goleic-acid_100gserum-proteins_100gchromium_100gbehenic-acid_100gdihomo-gamma-linolenic-acid_100garachidic-acid_100gcasein_100gbeta-carotene_100gfructose_100gph_100gcaffeine_100gfluoride_100glinoleic-acid_100gcollagen-meat-protein-ratio_100gomega-6-fat_100glactose_100gcarbon-footprint_100gpolyols_100genergy-from-fat_100gcocoa_100gvitamin-e_100gcopper_100gfruits-vegetables-nuts_100gzinc_100gingredients_from_palm_oil_tagsvitamin-b12_100gmagnesium_100gvitamin-d_100gvitamin-b1_100gvitamin-pp_100gcities_tagsoriginspolyunsaturated-fat_100gtraces_frpotassium_100gemb_codes_tagsmanufacturing_places_tagslabelslabels_frgeneric_nameimage_urlpackagingmain_category_frcategories_tagscategories_frpnns_groups_2vitamin-a_100gvitamin-c_100gtrans-fat_100gadditives_tagsfiber_100gnutrition-score-uk_100gnutrition_grade_frcarbohydrates_100gsugars_100gadditives_ningredients_that_may_be_from_palm_oil_nsodium_100gproteins_100gbrands_tagsproduct_namecountriesstatesstates_frcodecreated_tlast_modified_datetime00.20.40.60.81
Col%null
plotly-logomark
%null Col
fiber_100g 0.373742 fiber_100g
serving_size 0.341180 serving_size
nutrition-score-uk_100g 0.310382 nutrition-score-uk_100g
nutrition-score-fr_100g 0.310382 nutrition-score-fr_100g
nutrition_grade_fr 0.310382 nutrition_grade_fr

Dataframe with relevant columns¶

code url creator created_t created_datetime last_modified_t last_modified_datetime product_name brands brands_tags ... fat_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g sodium_100g nutrition-score-fr_100g nutrition-score-uk_100g
0 3087 http://world-fr.openfoodfacts.org/produit/0000... openfoodfacts-contributors 1474103866 2016-09-17T09:17:46Z 1474103893 2016-09-17T09:18:13Z Farine de blé noir Ferme t'y R'nao ferme-t-y-r-nao ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4530 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Banana Chips Sweetened (Whole) NaN NaN ... 28.57 28.57 64.29 14.29 3.6 3.57 0.000 0.00 14.0 14.0
2 4559 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Peanuts Torn & Glasser torn-glasser ... 17.86 0.00 60.71 17.86 7.1 17.86 0.635 0.25 0.0 0.0

3 rows × 34 columns

additives_n ingredients_from_palm_oil_n ingredients_that_may_be_from_palm_oil_n energy_100g fat_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g sodium_100g nutrition-score-fr_100g nutrition-score-uk_100g
count 248939.000000 248939.000000 248939.000000 2.611130e+05 243891.000000 229554.000000 243588.000000 244971.000000 200886.000000 259922.000000 255510.000000 255463.000000 221210.000000 221210.000000
mean 1.936024 0.019659 0.055246 1.141915e+03 12.730379 5.129932 32.073981 16.003484 2.862111 7.075940 2.028624 0.798815 9.165535 9.058049
std 2.502019 0.140524 0.269207 6.447154e+03 17.578747 8.014238 29.731719 22.327284 12.867578 8.409054 128.269454 50.504428 9.055903 9.183589
min 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 -17.860000 -6.700000 -800.000000 0.000000 0.000000 -15.000000 -15.000000
25% 0.000000 0.000000 0.000000 3.770000e+02 0.000000 0.000000 6.000000 1.300000 0.000000 0.700000 0.063500 0.025000 1.000000 1.000000
50% 1.000000 0.000000 0.000000 1.100000e+03 5.000000 1.790000 20.600000 5.710000 1.500000 4.760000 0.581660 0.229000 10.000000 9.000000
75% 3.000000 0.000000 0.000000 1.674000e+03 20.000000 7.140000 58.330000 24.000000 3.600000 10.000000 1.374140 0.541000 16.000000 16.000000
max 31.000000 2.000000 6.000000 3.251373e+06 714.290000 550.000000 2916.670000 3520.000000 5380.000000 430.000000 64312.800000 25320.000000 40.000000 40.000000
(320772, 34)
additives_n ingredients_from_palm_oil_n energy_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g nutrition-score-fr_100g
count 248939.000000 248939.000000 2.611130e+05 229554.000000 243588.000000 244971.000000 200886.000000 259922.000000 255510.000000 221210.000000
mean 1.936024 0.019659 1.141915e+03 5.129932 32.073981 16.003484 2.862111 7.075940 2.028624 9.165535
std 2.502019 0.140524 6.447154e+03 8.014238 29.731719 22.327284 12.867578 8.409054 128.269454 9.055903
min 0.000000 0.000000 0.000000e+00 0.000000 0.000000 -17.860000 -6.700000 -800.000000 0.000000 -15.000000
25% 0.000000 0.000000 3.770000e+02 0.000000 6.000000 1.300000 0.000000 0.700000 0.063500 1.000000
50% 1.000000 0.000000 1.100000e+03 1.790000 20.600000 5.710000 1.500000 4.760000 0.581660 10.000000
75% 3.000000 0.000000 1.674000e+03 7.140000 58.330000 24.000000 3.600000 10.000000 1.374140 16.000000
max 31.000000 2.000000 3.251373e+06 550.000000 2916.670000 3520.000000 5380.000000 430.000000 64312.800000 40.000000
code brands countries_tags additives_n ingredients_from_palm_oil_n nutrition_grade_fr energy_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g nutrition-score-fr_100g
0 4530 NA en:united-states 0.0 No d 2243.0 28.57 64.29 14.29 3.6 3.57 0.00000 14.0
1 4559 Torn & Glasser en:united-states 0.0 No b 1941.0 0.00 60.71 17.86 7.1 17.86 0.63500 0.0
2 16087 Grizzlies en:united-states 0.0 No d 2540.0 5.36 17.86 3.57 7.1 17.86 1.22428 12.0
additives_n energy_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g nutrition-score-fr_100g
0 0.0 2243.0 28.57 64.29 14.29 3.60 3.570 0.00000 14.0
1 0.0 1941.0 0.00 60.71 17.86 7.10 17.860 0.63500 0.0
2 0.0 2540.0 5.36 17.86 3.57 7.10 17.860 1.22428 12.0
3 2.0 1833.0 4.69 57.81 15.62 9.40 14.060 0.13970 7.0
4 1.0 2230.0 5.00 36.67 3.33 6.70 16.670 1.60782 12.0
... ... ... ... ... ... ... ... ... ...
170705 5.0 1031.0 1.28 95.31 0.10 1.47 0.004 0.00100 2.0
170706 1.0 1393.0 2.78 61.11 30.56 8.30 5.560 0.95250 11.0
170707 0.0 1477.0 0.00 87.06 2.35 4.70 1.180 0.03048 -1.0
170708 0.0 21.0 0.20 0.50 0.50 0.20 0.500 0.02540 2.0
170709 0.0 0.0 0.00 0.00 0.00 0.00 0.000 0.00000 0.0

170710 rows × 9 columns

additives_n energy_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g nutrition-score-fr_100g
count 170710.000000 1.707100e+05 170710.000000 170710.000000 170710.000000 170710.000000 170710.000000 170710.000000 170710.000000
mean 1.967916 1.206842e+03 4.635023 34.584268 14.994453 2.865459 7.750171 1.375087 8.796579
std 2.517014 7.903058e+03 6.980362 28.227493 19.421789 4.403977 7.953934 14.621370 9.077207
min 0.000000 0.000000e+00 0.000000 0.000000 -17.860000 0.000000 -3.570000 0.000000 -15.000000
25% 0.000000 4.520000e+02 0.000000 8.000000 1.500000 0.000000 2.100000 0.116840 1.000000
50% 1.000000 1.218000e+03 1.670000 26.670000 5.300000 1.600000 5.670000 0.678180 9.000000
75% 3.000000 1.745000e+03 6.800000 60.200000 23.330000 3.600000 10.710000 1.361440 16.000000
max 31.000000 3.251373e+06 210.000000 209.380000 134.000000 178.000000 100.000000 3048.000000 40.000000

Data Visulization¶

Visualization of numerical columns¶

Visualization of categorical columns¶

Anova Test¶

nutrition_a nutrition_b nutrition_c nutrition_d nutrition_e
nutrition_count
0 1096.0 1941.0 1833.0 2243.0 2092.0
1 1887.0 1824.0 1674.0 2540.0 2197.0
2 1904.0 2632.0 1954.0 2230.0 1883.0
3 1749.0 1548.0 1941.0 1464.0 2197.0
4 159.0 1674.0 1904.0 2092.0 1569.0

Anova test result¶

89.92095963378982
1.806909715123044e-76

PCA¶

array([[-0.78184784,  1.40053847,  3.43791013, ..., -0.52556421,
        -0.09403702,  0.57333893],
       [-0.78184784,  1.00000986, -0.66552251, ...,  1.2710571 ,
        -0.05060909, -0.96903005],
       [-0.78184784,  1.79443582,  0.10431995, ...,  1.2710571 ,
        -0.01030798,  0.3530005 ],
       ...,
       [-0.78184784,  0.38462815, -0.66552251, ..., -0.82604881,
        -0.09195248, -1.07919926],
       [-0.78184784, -1.54639722, -0.63679704, ..., -0.91154233,
        -0.09229991, -0.74869162],
       [-0.78184784, -1.57424854, -0.66552251, ..., -0.97440522,
        -0.09403702, -0.96903005]])

PCA results¶

PCA(n_components=4)
array([0.31020206, 0.18258611, 0.153259  , 0.11133615])
array([0.31020206, 0.49278817, 0.64604718, 0.75738332])
array([ 0.10280092,  0.52236288,  0.38150724,  0.38947938,  0.39423053,
        0.10101489,  0.09058342, -0.00297288,  0.4954706 ])
array([-0.3573938 ,  0.21887959,  0.30557293, -0.29689876, -0.43302178,
        0.30035385,  0.60365517,  0.02248405,  0.01457077])
array([-0.1720052 ,  0.11640431, -0.40173669,  0.46440778,  0.10245973,
        0.65433582,  0.0470006 , -0.06691056, -0.36668495])
array([ 9.81346784e-02, -1.89275569e-04, -8.22712122e-02,  4.76224266e-02,
       -2.53504418e-02,  5.11189066e-02,  4.24901372e-02,  9.87970300e-01,
        1.36597580e-02])

Kmeans clustering¶

array([[-0.78184784,  1.40053847,  3.43791013, ..., -0.52556421,
        -0.09403702,  0.57333893],
       [-0.78184784,  1.00000986, -0.66552251, ...,  1.2710571 ,
        -0.05060909, -0.96903005],
       [-0.78184784,  1.79443582,  0.10431995, ...,  1.2710571 ,
        -0.01030798,  0.3530005 ],
       ...,
       [-0.78184784,  0.38462815, -0.66552251, ..., -0.82604881,
        -0.09195248, -1.07919926],
       [-0.78184784, -1.54639722, -0.63679704, ..., -0.91154233,
        -0.09229991, -0.74869162],
       [-0.78184784, -1.57424854, -0.66552251, ..., -0.97440522,
        -0.09403702, -0.96903005]])
additives_n energy_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g nutrition-score-fr_100g Category
0 0.0 2243.0 28.57 64.29 14.29 3.6 3.57 0.00000 14.0 bad
1 0.0 1941.0 0.00 60.71 17.86 7.1 17.86 0.63500 0.0 good
2 0.0 2540.0 5.36 17.86 3.57 7.1 17.86 1.22428 12.0 worst
3 2.0 1833.0 4.69 57.81 15.62 9.4 14.06 0.13970 7.0 good
4 1.0 2230.0 5.00 36.67 3.33 6.7 16.67 1.60782 12.0 good

Kmeans results¶

Counter({2: 34462, 3: 32980, 0: 21238, 1: 64500, 4: 17506, 5: 8})